-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(filemanager): ingest_id tagging and object move tracking #585
Conversation
…ting a moved object
…name ingest_id tag
Did you test that with any of the BYOB buckets? |
Good point, I haven't tested on BYOB - I'll PR that on the infra repo. |
the same object that has been moved, or two different objects. This is because S3 only tracks `Created` or `Deleted` | ||
events. | ||
|
||
To track moved objects, the filemanager stores additional information in S3 tags, that gets copied when the object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc nit:
"stores additional information in S3 tags. The tag field X gets updated when the object is moved."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and/or "see below for tag key-value details"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
The object tagging mechanism also doesn't differentiate between moved objects and copied objects with the same tags. | ||
If an object is copied with tags, the `ingest_id` will also be copied and the above logic will apply. | ||
|
||
## Alternative designs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIce doc!
I'd probably add another small note on the checksum approach: it can't be used if the checksums are not expected to be the same, e.g. with compression, which is a big use case for us.
The new tag is also stored in the `ingest_id` column. | ||
* The database is also queried for any records with the same `ingest_id` so that attributes can be copied to the new record. | ||
|
||
This logic is enabled by default, but it can be switched off by setting `FILEMANAGER_INGESTER_TRACK_MOVES`. The filemanager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tagging is currently part of the ingestion process, right?
So there's a possibility that this may slow down the ingestion and may become an issue under heavy load?
Not for now, but if that should become the case, we could think of an async tagging strategy.
Given the option to disable tagging (or tagging failing/missing for other reasons), it would be great to think of an async tagging option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree with this, it could potentially slow things down as it's done on ingestion. It's a bit tricky because the act of tagging the object conveys the information of the move - ideally this would be done as soon as possible (i.e. on ingestion). Anything async would extend the window that the object isn't tagged, meaning that the move can't be tracked. In practice this probably wouldn't make a different if the object isn't moved as soon as it's created.
There are s3:ObjectTagging:*
events which might be good for this that I'll look into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always challenging and tradeoff. Let's give it a shot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, tricky but that's a general "issue" with event bases systems: there's an inevitable delay/asynchronicity.
And I am not saying we should implement that now. An open ticket or comment in the code to keep track of it is perfectly fine.
To compensate for potential concurrency issues, the mentioned support of checksums, name matches, etc could be used... at least to some extend. All future considerations... all good for now!
the same object that has been moved, or two different objects. This is because S3 only tracks `Created` or `Deleted` | ||
events. | ||
|
||
To track moved objects, the filemanager stores additional information in S3 tags, that gets copied when the object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and/or "see below for tag key-value details"
The new tag is also stored in the `ingest_id` column. | ||
* The database is also queried for any records with the same `ingest_id` so that attributes can be copied to the new record. | ||
|
||
This logic is enabled by default, but it can be switched off by setting `FILEMANAGER_INGESTER_TRACK_MOVES`. The filemanager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is always challenging and tradeoff. Let's give it a shot!
Closes #584
Mechanism
ingest_id
.ingest_id
is reused, which allows creating the sequence of records representing the moved object.Implementation
GetObjectTagging
andPutObjectTagging
capabilities to filemanager.ingest_id
column to database which matches theingest_id
on the S3 object tags.